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ABSTRACT 
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CHAPTER ONE 
I ntroduction 

In the past year, we have studied the distributed computer system. The 
work that we have accomplished can be summarized as follows: 

(1) Top down design strategy. 

(2) Modelling and performance analysis of asynchronous concurrent systems. 

(3) Methods to detect deadlock in distributed systems. 

(4) To develop an theory of software reliability based on the nature of the 
input domain of the program. 

Current approaches to the design and analysis of computer systems are 
based primarily on experience and intuition. The specification, design, 
implementation and evaluation of computer systems are very expensive, difficult 
to test adequately, slow to deploy and difficult to adapt to changing requirements. 
Tu.-se difficulties have led to many schedule slippages and project failures. 

This research attempts to develop a systematic approach for the design 
and analysis of computer systems. However, the activities involved are so wide 
and varied that only part of the full scope of the design and development process is 
studied. A top-down deveiopmeru approach icr computer sys\ems is developed. 

The techniques for prediction and verification of the performance of 
asynchronous concurrent systems can be classified into two categories: Cl) 
deterministic models, and (2) probalistic models. In deterministic models, it is 
usually assumed that the task arrival times, the task execution times, and the 
synchronization involved are known in advance to the analysis. With this 
information, a very precise prediction of the system performance can he obtained. 
This approach is very useful for performance evaluation of real time control 
systems with hard deadline requirements. 
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In probabilistic models, the task arrival rates and the task service times 
are usually specified by probabilistic distribution functions. The synchronization 
among tasks is usually not modelled, because otherwise the number of system 
states becomes so large that it would be impossible to perform any analyses. 
Probabilistic models usually give a gross prediction on the performance of the 
system and are good for early stages of system design when the system 
characteristics are not well understood. In this paper, we focus on performance 
analysis of real time systems and therefore we have chosen the deterministic 
approach. In particular, in order to model clearly the synchronization involved in 
concurrent systems, the Petri net model is chosen. 

In our approach, the system to be studied is first modelled by a Petri net. 
Pased on the Petri net model, a given system is classified as either: (1) a 
consistent system; or (2) an inconsistent system (the definitions are given in later 
sections of the paper). Most real-world systems fall into the first class and so we 
focus our discussion on consistent systems. Due to the difference in complexity 
involved in the performance analyses of different types of consistent systems, they 
arc further subclassified into: (i) decision-free systems; (ii) safe persistent 
systems; ..rd (d.) genera 1 s>5t*ir Procedures for predicting ar.d verifying the 
system performance of all three types are presented. It is found that the 
computational complexity Involved increases in the same order as they are listed 
above. 

Our work in system deadlocks concentrates on analysis techniques for 
deadlocks in asynchronous concurrent systems. This includes multi-programrned 
systems, multiple processor systems and computer networks. In particular, we 
studyin detail deadlocks caused by conflicts in mutual exclusive accesses to 
resources with the constraint that each resource type has only one member. 
Deadlocks due to the erroneous nesting of binary semaphores (Dij 71), nesting of 
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critical regions (Dr i 72 ,Bri 73a) and nesting of monitors (Brj 73b, Hoa 74) are 
important members in the above category. In addition to these, deadlocks due to 
conflicts in data file lockings in distributed database systems also fall into the 
above category. 

In order to facilitate the use of digital computers in critical, real-time 
control systems (e.g., nuclear power plant safety control systems), the software 
must be thoroughly validated. Software reliability is a measure of the confidence 
in the operational correctness of the software. Since the early 70's several 
software reliability models have been proposed. However, most of these models 
are ad hoc extensions of hardware reliability models and their assumptions have not 
been validated. 

This thesis first classifies and then evaluates several existing software 
reliability models according to some proposed criteria. Then it develops a theory 
of software reliability based on the nature of the input domain of the program, i.c., 
the size of the errors and the number, complexity and continuity of equivalence 
classes formed in the input domain. 

A general framework is developed for software reliability growth models 
used during the debugging phase of software development. It incorporates the 
concepts of residual error size and the testing process used. Two specific models 
are then developed. The first approach models the effect of debugging actions on 
the residual error size as a random walk process with a continuous state-space. 

The time beween the detection (and correction) of successive errors is then 
modelled as a doubly stochastic Poisson process. The application of this model and 
its statistical evaluation are also discussed. The second approach is a bayesian one 
dealing with the prior and posterior distributions of the residual error size. 

During the validation phase, the program is tested extensively in order to 
determine its reliability. Even if new errors are detected, they are not corrected. 











A model is developed for directly estimating the correctness probability of the 
software based on the set of test cases used and on the number of equivalence 
classes, their complexity and continuity properties. The model is applied to a pilot 
program developed for nuclear power plant safety control systems. 

A predictive model, applicable during the operational phase, is developed 
based on the continuity of the input domain. Anuncertainty measure using fuzzy 
set theory is proposed for the operational software. The perturbation of the 
residua! error size due to different maintenance activities is also discussed. 

The theory is then applied to the evaluation of software validation 
techniques and programming languages. Some language constructs and 
documentation techniques which can improve the reliability of the software are 
proposed. The application of the theory to different aspects of project 
management is also discussed. 

This report is divided into 6 chapters. The top down design strategy is 
presented in chapter 2. Chapter 3 presents the techniques for performance 
evaluation in concurrent systems. Chapter 4 develops the procedure of deadlock 
c'ctcc'.ioi h distributed systems. Chapter 5 presents the dcve'opment of a theory 
of Software Reliability model based on the nature of tire input domain of the 
Program. Lastly, section 6 gives a conclusion of this report. 
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CHAPTER 2 


A Top-Down Approach for the Development 
of Distributed Computer Systems 

2.1 Introduction 

In this chapter, a systematic approach for the development of distributed 
computer systems is discussed. The objective is to develop guidelines and 
automated tools for the design of distributed systems. The philosophy behind the 
approach is based on top-down hierarchical modelling of DCS. 

2.2 The Top-Down Development Approach 

The top-down approach proposed here can be broken down into four 
successive phases (Figure 2.1): 

(1) Requirement and specification phase. 

(2) Design phase. 

(3) Implementation phase. 

(4) Evaluation and validation phase. 

The requirement and specification phase starts with some (possibly 
incomplete, vague, and informal) system requirements that approximate the 
desired s/s.em, and finishes when the modiiTeo a.id elaborated requirements have 
been formally encoded and tested to the satisfaction of the system engineers and 
the "customers". 

The design phase starts with the requirement specifications and finishes 
when the system specifications are produced. The objective is to optimize and 
organize the system in a well formed structure. It involves an hierarchy of 
decomposition and partitioning of the system into subsystems. 

The implementation phase takes the system specification and develops the 
system architecture. It then maps the system functions into either hardware or 
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software functions. It is only at this step that physical constraints and technology 
cornes into consideration. 

The final step is the evaluation and validation of the system. This phase 
uses the bottom up validation approach. It takes the final design and ensures that 
the system meets the original requirements. This step uses both analytical 
modelling and simulation. 

2.2.1 Requirement and Specification Phase 

This phase consists of four major steps (Figure 2.2): (i) requirement 
elaboration, (ii) requirement specification and attribute formulation, (iii) process 
definition, and (iv) verification of requirements. 

2.2.1.1 Requirement Elaboration 

The requiremment elaboration step can be considered as a problem 
understanding stage. The objective is to let the requirement engineers to have a 
bird's-eye view on the operations of the system. 

2.2.1.2 Requirement Specification and Attribute Formulation 
(A) Req ui rement Specification 

In real-world situations, the problems are so complex that pure 
mathematical formulation is usually impossible. The approach of using a 
specification language is chosen. 

A specification language is a syntactically and semantically well defined 
language possibly intermixed with mathematic equations. Its whole purpose is to 
provide a efficient and effective medium for defining the system requirements. 
The language should be amenable to both static (hierarchical relationship, data 
definition, etc.) and dynamic (control flow and data flow) analyses (Bel 76,Ham 76, 
Pet 77). 
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Attribute Formulation 


(B) 

For a distributed system, the attributes are cost, reliability, availability, 
flexibility, expandability, reconfigurability, etc. Some of them are very difficult 
to be quantified. Usually, the situation is further complicated by the fact that the 
system attributes are interdependent on each other and they may compete and 
interact with each other. The designers are forced to consider design tradeoffs 
early in the development process. 

2.2.1.3 Process Definition 

The process definition step accepts inputs from the requirement 
specification and attribute formulation step and identifies major functions to be 
performed. First the input stimulus and the required responses are characterized. 

2.2.1.4 Verifiction of Requirements 

In this step, the processes of the virtual system is verified to meet the 
original users' requirements. As the system is developed hierarchically, the 
specifications of one level arc the requirements of the next level. To verify the 
correctness of the virtual system, we only have to verify the consistency between 
the specifications and requirements between consecutive levels. 

2.2.2 Design Phase 

The design phase starts with the defined processes which are the output of 
the requirement and specification phase. The major steps involved in the design 
phase are decomposition and partitioning, functional specification and finally 
verification (Figure 23). 

2.2.2.1 Decomposition and Partitioning 

In order for the design process to be manageable, it must be decomposed 
and partitioned in such a way that most decisions can be made locally, based on 




















data available within a local area of the developing system specifications. To 
achieve this, the system is decomposed into progressively more detailed 
components which are then grouped into partitions (subsystems) to minimize the 
amount of interactions between partitions. A graph theoretical approach for the 
systematic decomposition and partitioning of a system is developed. 

2.2.2.2 Functional Specification 

The next major step in the design phase is functional specification of the 
partitioned processes. This functional specification is different from the process 
specification described in the reuirement specificaton phase. The objective of the 
process specification is to define the interactions of the processes for the 
decomposition step. The objective of the functional specification here is to define 
the characteristics of the functions so to enable optimization in the functions to 
processors mapping. 

2.2.2.3 Verification of Desig n 

No single model is powerful enough to have all the above features. The 
control flow of a distributed system can be modelled quite effectively by Petri net, 
UCLA graph model, E-net, ecc. (Pet 7/, Cos 1, Nee 73). These models tepresen; 
clearly the flow of information and control in a distributed system, especially those 
which exhibit asynchronous and concurrent properties. 

In order to predict and analyze the performance of the designed system, 
queuing models and simulation are often used (Per 7S). 

2.2.3 Implementation, Evaluation and Validation 

The implementation phase takes the virtual system and develops the 
system architecture. It then maps the system functions into cither hardware or 
software functions. 
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The final phase is the evaluation and validation of the system. This phase 
uses the bottom up validation approach. Both analytical modelling and simulation 
will be used. Because of the hierarchical decompositon, each subsystem to be 
inalyzed should be small and therefore complexity should be low. 

1 .3 Summary 

The development process is divided into four successive phases: (i) 
requirement and specification phase; (ii) design phase; (iii) implementation phase; 
and Civ) evaluation and validation phase. The first two phases are explored in 
detail. The last two phases of the development process are only outlined briefly 
because they are very technology and architecture dependent. 











CHAPTER 3 


Performance Evaluation of Asynchronous Concu r rent System 

3.1 Review of Petri Nets 

3.1.1 Basic Properties of Petri Nets 

Petri nets (PET 77, AGE 75) are a formal graph model for modelling the 
flow of information and control in systems, especially those which exhibit 
asynchronous and concurrent properties. 

3.1.2 Application of Petri Nets in Control Flow Analysis 

Petri nets have been used extensively to study the control flow of 
computer systems. By analyzing the livencss, boundedness and proper termination 
properties of the Petri net model of a computer system, many desirable properties 
of the system can be unveiled. 

A Petri net is live (MAC 75, HOL 71) if there always exists a firing 
sequence to fire each transition in the net. By proving that the Petri net is live, 
the system is guaranteed to be deadlock free. 

A Petri net is bou nde d (KAR 66, LIE 76) if for each place in the net, there 
exists an upper bound to the number of tokens that can be there simultaneously. If 
tokens are used to represent intermediate results generated in a system, by proving 
that the Petri net model of the system is bounded, the amount of buffer space 
required between asynchronous processes can be determined and therefore 
information loss due to buffer overflow can be avoided. If the upper bound on the 
number of tokens at each place is one, then the Petri net is safe. Programming 
constructs like critical regions (BRI 72) and monitors (BR1 73, HOA 74) can be 
modelled by safe Petri nets. 

A Petri net is properly te rminating (GOS 71, POS 7h) il the Petri net 
always terminate in a well- defined manner such that no tokens arc left in the net. 
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ESy verifying that the Petri net is properly terminated, the system is guaranteed to 
function in a well behaved manner without any side-effects on the next initiation. 

3.1.3 Extended Timed Petri Ne ts 

In order to study the performance of a system, the Petri net model is 
extended to include the notion of time (RAM 74). In such extended nets, an 
execution time, r, is associated with each transition. When a transition initiates its 
execution it takes r units of time to complete its execution. With the extended 
Petri net model the performance of a computer system can be studied. 

3.2 Performance Evaluation 

The work that we have accomplished in performance evaluation is to use 
Petri nets to find the maximum performance of the system, i.o., to find the 
minimum cycle time (for processing a task) of the system. As pointed out before, 
different computational complexities are involved in the analyses of systems of 
different types. The approaches for analyzing each type of system arc studied 
separately in detail in the following section. Before we come to the analyses, some 
definitions are in order. 

Definitio n. In a Petri net, a sequence of places and transitions, P[tji > 2t2*”f > n> 1S a 
direc ted path from Pj to P tl if transition tj is both an output transition of place Pj 
and an input transition of place Pj + j for 1 _ i n-1. 

Definitio n. In a Petri net, a sequence of places and transitions, PitjP2 t 2’”Pn> * s a 
directed circuit if P]tjP2t2”'Pri * s a directed path from Pj and P n and P] equals 
Pm 

Definitio n. A Petri net is strongly connected if every pair of places is contained 
in a directed circuit. 
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In this paper, we presented the performance analysis techniques for 
strongly connected non-terminating Petri nets. Extensions to analyze weakly 
connected Petri nets are quite straightforward so it will not be discussed in this 
report. 

3.2.1 Consistent and Inconsistent Systems 

The first step involved in our approach to analyze the performance of a 
system is to model it by a Petri net. A system is a consistent (inconsistent) system 
if its Petri net model is consistent (inconsistent). A Petri net is consistent 
(condition A) if and only if there exists a non-zero integer assignment to its 
transition such that at every place, the sum of integers assigned to its input 
transitions equals the sum of integers assigned to its output transitions; otherwise, 
the system is inconsistent. If a system is live and consistent, the system goes back 
to its initial configuration (state) after each cycle and then repeats itself. If a 
system is inconsistent, either it produces an infinite number of tokens (i.e., it needs 
infinite resources) or consumes tokens and eventually comes to a stop. 

3.2.2 Decision-free Systems 

A system is a uccieioii- free system if it. Petri ret rn jdel is t 
decision-free Petri net. A Petri net is decision-free if and only if for each place in 
the net, there is one input arc and one output arc. This means that tokens at a 
given place arc generated by a predefined transition (its only input transition) and 
consumed by a predefined transition (its only output transition) 

For a decision-free system, the maximum performance can be computed 
quite easily. 

Theorem 3.1 . For a decision-free Petri net, the number of tokens in a circuit 
remains the same after any firing sequence. 
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Definition . Let S^(nj) be the time at which transition tj initiates its nj-th 
execution. The cycle time , Ci, of transition ti is defined as 


lira S^On^/ru • 
n. >CJ 


Theorem 3.2 . All transitions in a decision-free Petri net have the same cycle 
time. 

Theorem 3.3 . For a decision-free Petri net, the minimum cycle time (maximum 
performance) C is given by 



such that S i^ n i^ ~ a i + Ct \ 

T t< **- r i c Sun t ’ ie execution times of the 

transitions in circuit k 

s: ~ total number of tokens in the places 

in circuit k 


q * number of circuits in the net 

ai = constant associated with transition tj 

Lj< = loop (circuit) k 

Mj ;• number of tokens in place Pj 

Wc develop a very fast procedure to verify the performance of a system. 

A procedure for verifying syst e m performance 

(1) Express the token loading in an nxn matrix, P, where n is the number of 

places in the Petri net model of the system . Entry (A,13) in the matrix 











equals x if there are tokens in place A, and place A is connected directly 
to place B by a transition. Matrix P of the example system in Fig 3.1 is 
shown below: 
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Matrix P 


(2) Express transition time in an nxn matrix, Q. Entry (A,B) in the matrix 

equals to rj (execution time of transition i) if A is an input place of 
transition i and B is one of its output places. Entry (A,B) contains the 
symbol "w” if A and B arc not connected directly as described above. 
Matrix Q for the example system is: 
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( 3 ) 


Compute inatrix CP-Q (with n-2 - oo for n N), thenuse Floyd's algorithm 
(Flo 62) to compute the shortest distance between every pair of nodes 
using matrix CP-Q as the distance matrix. The result is stored in matrix 
S. There are three cases; 

(a) All diagonal entries of matrix S are positive (i.e., CN’k -Tk O for all 
circuits) -- the system performance is higher than the given 
requirement. 

(b) Some diagonal entries of matrix S are zero's and the rest are positive 
(i.e., CN{< -Tk -O for some circuits and CNk -Tk O for the other 
circuits) -- the system performance just meets the given 
requirement. 

(c) Some diagonal entries of matrix 5 are negative (i.e., CNk -Tk O for 
some circuits) -- the system performance is lower than the given 
requirement. 

In the example, for C - 15, CP-Q is 
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After applying Floyd's algorithm to find the shortest distance between every pair 
of places we have: A E C D K F G 
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Since the diagonal entries are non-negative, the performance 
requiremment of C = 15 is satisfied. Moreover, since entries (A,A) (C,C), (F.,E) and 
(G,G) are zero's, C = 15 is optimal (i.e., it is the minimum cycle time). In addition, 
when a decision-free system runs at its highest speed, CN[< equals to T|< for the 
bottleneck circuit. This implies that the places that are in the bottleneck circuit 
■viil have zero diagonal entries in matrix S. In the example, the bottleneck circuit 
is AtiCt2Et4Gt> With this information, the system performance can be improved 
by either reducing the execution times of some transitions in circuit (by using 
faster facilities) or by introducing more concurrency in the circuit (by introducing 
more tokens in the circuit). Which approach should be taken is application 
dependent and beyond the scope of this thesis. 

The above procedure can be executed quite fast. The formulation of 
matrices P and Q takes O(n^) steps. The Floyd algorithm takes O(n^) steps. As a 
whole, the procedure can be executed in 0(n 3 ) steps. Therefore, the performance 
requirement of a decision-free system can be verified quite efficiently. 

3.2.3 Safe Pesistent S y stems 

A system is a safe persistent system if its Petri net model is a safe 
persistent Petri net. A Petri net is a safe persistent Petri net if and only if it is a 
safe petri net and for all reachable markings, a transition is disabled only by firing 
the transition. To compute the performance of the system, v/e first transform it 
into a decision-free system and then use the algorithm discussed in the previous 
subsection to compute the system performance. 

A persistent Petri net can be transformed into a decision-free Petri net 
by tracing the execution of the system for one cycle. 

3.2.k General Systems 

A system is a general system if its Petri net model is a general Petri net. 
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A Petri net is a general Petri net if it is a consistent Petri net and there exists a 
reachable marking such that the firing of a transition disables some other 
transitions. 

General systems are very difficult to analyze. In the next theorem, we 
show that it is unlikely that a fast algorithm exists to verify the performance of a 
general system. A method of computing the upper and lower bounds of the 
performance of a conservative general system (Lie 76) is proposed. For a 
non-conservative general system, no good heuristics are known to the authors and 
further research is needed. 

Theorem 3.4 . Verifying the performance of a general Petri net is an NP-cornpIete 
problem (Kar 72). 




CHAPTER 4 


System Deadlock 

4.1 An Approach to Deadlock Prevention 

The scope of our study on system deadlock is restricted to systems using: 
(i) binary semaphores; (ii) critical regions; and (iii) monitors as their interprocess 
synchronization mechanisms. This enforces structural design and greatly reduces 
the computational complexities involved in the analyses. Based on the above 
synchronization constructs, a formal graph model (the request-possession graph) is 
devloped to model deadlocks in these systems. The necessary and sufficient 
conditions for the occurrence of a deadlock are derived. Based on these conditions, 
techniques for uncovering potential deadlocks in a system are developed, and a 
systematic approach for the construction of deadlock-free systems using critical 
tegions and/or monitors is proposed. 

4.1.1 The Request-Possession Gra p h Mode l 

An request-possession graph (an RP-graph) is a formal graph model 
developed to study deadlocks in systems which use binary semaphores, critical 
cgins ar.d.'or -r nitons as their sy .chrenhat : .cr nechtnis .u. It is a o'ir *ct-d 
bipartite graph with two types of nodes and two types of arcs (Pig. 4.1b): (1) 
resource reference nodes (which are called reference nodes in short in the rest of 
the chapter), and (2) resource nodes. The reference nodes arc used to represent 
accesses of resources in a system and the resource nodes arc used to represent 
resources. A dotted arc directed from a reference node to a resource node 
reptesents the request of the resource from the reference node. A solid arc 
directed from a resource to a reference node represents the assignment of the 
r esource to the reference node. 
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The RP-graph of a program can be generated by scanning through the 
program once. The procedure for constructing tfie RP-graph of a concurrent 
system can be best illustrated by an example. Figure 4.1 shows a concurrent 
;ystem together with its RP-graph. For each binary seamphore (P or V operation) 
in the system, there is a corresponding resource node (reference node) in the 
RP-graph. For each P operation in the system, a dotted arc is drawn from the 
corresponding reference node to the corresponding resource node. Solid arcs are 
then drawn from the resource nodes to a reference node for the resources that 
have been possessed by the process when it begins to execute the reference node. 
For example, solid arcs are drawn from resource nodes a and b to reference node 
V(b) or process X because both resources a and b are possessed by the process when 
i t begins to execute instruction V(b). Following the above procedure, the RP-graph 
cf a system can be constructed in linear time to the number of instructions in a 
program. As the releases of resources will never bring a system into a deadlock, 
the reference nodes corresponding to V operations are omittd in RP-graphs. 

4.1.2 The Necessary and Sufficient Conditions for Deadlocks 

T he necessary condition fer deadlocks developed in t'vs section is 
applicable to systems using binary semaphores, critical regions and/or monitors as 
their synchronization mechanisms. The sufficient condition for deadlocks 
developed is only applicable to systems using critical regions and/or monitors as 
their synchronization mechanisms. This is due to the unstructurencss of 
semaphores and are explained later in this section. 

De finition . A system is safe if and only if it is deadlock-free. A system is unsafe 
if and only if it potentially can get into a deadlock state. 

The ore m 4.1. Only gives the necessary condition for an unsafe system. The 
existence of a directed cycle in the RP-graph of a system docs not imply that the 
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system is unsafe. 

The RP-graph can be generated automatically in linear time to the 
number of instructions in a system. The Floyd algorithm can be used to detect the 
existence of directed cycle in the RP-graph, which has execution time O(n^) steps 
where n is the number of nodes in the generated RP-graph. As a rsult, the 
proposed algorithm can be executed in poly-nominal time to the number of 
instructions in a systm. 

Before we discuss the sufficient condition for a safe system, some 
extensions have to be made on the RP-graph. The resultant model is called the 
augmented request-possession-graph (the ARP-graph). It is very similar to the 
RP-graph except that each reference node, r, is given a set of names, s r , such that 
s“S r if and only if: 

(1) s is the name of the process when it begins to execute note r, or, 

(2) s is the name of a resource that has been possessed by the process when it 
begins to execute node r (i.e., there exists a solid directed arc from 
resource so to node r in the RP-graph). 

TaeOi Cii. *, .2. A system i- safe if and o:\ly ‘f 'ts ARP-graph dees rot contain a 
directed cycle with distinct names on its reference nodes (i.e., 5 U AS V = 0 for all 
pairs of nodes, u and v, in the cycle). 

Theorem fr.2 . Is true for a system which uses critical regions and monitors as its 
synchronization mechanisms, however, it does not hold for a system that uses 
semaphores as its synchroniz.aton mechanism. From this point onwards, when we 
talk about systems, we mean systems which susc critical regions and/or monitors as 
their synchronization mechanisms. 

One application of theorocm l i .2 is to prove the safety of a system. 

Before we use the Theorem, v/e have to develop an effective procedure to 
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determine whether there exists a directed cycle with distinct labels on its nodes in 
a labelled directed graph. However, it is shown in the following theorem that the 
above problem is NP-complete (i.e., it is unlikely to have a fast algorithm to solve 
the problem). 

Theorem 4.3 . It is NP-complete to find a directed cycle with distinct labels on its 
nodes in a labelled directed graph. 

4.1.3 An Approach to the Design of Deadlock-Free System 
Theorem 4.4 

If all critical regions and/or monitiors are linearly ordered, and all 
processes enter a critical region or a monitor at a higher level before those at a 
lower level, deadlock cannot occur. 

The above strategy imposes severe constraints on the nesting of critical 
regions and/or monitors. Two of its drawbacks are: (1) reducing the concurrency 
in a system; and (2) reducing the transparency of a system. One approach to 
remedy some of the drawbacks is to group critical regions and/or monitors into sets 
allowing unordcred noting within each set. A linear ordering is then imposed 
among sets. A process must not enter a critical legion or a monitor in a set at a 
higher level after it has entered one in a set at a lower level. The linear ordering 
among sets guarantees that deadlock cannot occur due to improper nesting of 
critical regions or monitors in different sets. The deadlock-free condition within 
each set is verified by the deadlock detection procedure discussed in section 4.1.1. 
Tills approach provides: (1) good programming style; (2) higher degree of 
concurrency; (3) no run time overhead; and (4) automatic deadlock detection during 
compilation. 

'.'.2 Deadlock Detection in Dist r ibuted Dat a Bases 

In a distributed data base, deadlocks can be detected quite easily by using 
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a centralized control strategy. Whenever a process locks or releases a data file, it 
gets the permission from a central control node. This control node maintains a 
demand graph for the whole system and checks for deadlocks by searching for a 
directed cycle in the graph. However, the approach is inefficient. All data 
accesses have to get the permissions from the central control node although they 
may not cause any deadlocks. This slows down the system, wastes the system 
communication bandwidth and unnecessarily congests the communication 
subsystem. Above all, if the control node goes down, it is very difficult to recover 
the system from the failure. 

Another approach for deadlock detection is to store the resource status 
locally at each site. Periodically, a node is chosen to be the control node. 

Resource status arc then sent from each site to the control node for analyses. This 
remedies most of the drawbacks of the centralized approach. However, due to the 
inherent communication delay, the chosen control node may get an inconsistent 
view of the system, and it may make a wrong conclusion. 

V.'e have developed three approaches to construct consistent demand 
graph. In the approaches, it is assumed that each transaction is given a unique 
name. 

4.2.1 A Two Phase Deadlock De t ection Protocol 

In this protocol, each site maintains a status table for all resources that 
are owned by the site. For each resource, the table keeps track of the transaciton 
that has locked the resource (if one exists) and the transactions which arc waiting 
for the resource (if they exist). Periodically, a node is chosen as the control. The 
chosen control node performs the following operations: 

(1) Broadcasts a message to all nodes in the system requesting them to send 

their status tables and waits until all tables have been received. 

(7.) Constructs a demand graph for the system: 
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(a) If there is no directed cycle, the system is not in a deadlock and the 
node releases its control. 

(b) If there is a directed cycle, the node continues its execution. 

O) Broadcasts a second message to all nodes in the system requesting them 

to send their status tables and waits until all tables have been received. 

(4) Constructs a demand graph for the system using only transactions that are 

reported in both the first and second reports: 

(a) If there is no direct cycle, the system is not in a deadlock and the 
node releases its control. 

(b) If there is a directed cycle, the systrn is in a deadlock. The node 
reports the deadlock situation to a deadlock resolver. 

The above procedure suses a two phase commit protocol. By only using 
transactions that arc reported in both the first and the second status reports 
inconstructing the demand graph, a consistent system stat is obtained. The main 
advantage of this protocol is its simplicity. The drawback is the requirement of 
two status reports from each site before a deadlock can bo determined. In general, 
the protocol is good for systems in which deadlocks occur only infrequently. 

f 

4.2.2 A One Phase Deadlock Detection Protocol 

In this protocol, a deadlock is detected in one communication phase. Bach 
site maintains a resource status table for all local resources and a process status 
table for all local processes. The resource status table keeps track of the 
transactions that have locked a local resource and the trnsactions which arc 
vaiting for a local resource. The process status table keeps track of tire 
transactions that are being owned by processes local to the site. The system 
operates according to the following rules: 

(A) A process at site S requests a resource — a transaction (5,t) is created, 
where S is the site name and t is the time at which the transaction is 
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initiated. An entry (S,t,w) is put into the process status table of the site 
indicating the transaction (S,t) is waiting for a resource. A message is 
sent to acquire the resource. 

(B) Site T receives a message that transaction (S,t) requests a resource local 
to T: 

(i) If the resource is free, the resource is assigned to the transactin and 
a lock is set on the resource. An entry (S,t,a) is created in the 
resource status table of the site and a message is sent to notify the 
requesting process of the assignment. 

(ii) If the resource is being locked, an entry (S,t,w) is created in the 
resource status table of the site and a message is sent to 
acknowledge the receiver of the request. 

(C) Site S receives a resource assignment message for transaction (S,t) -- the 
entry (5,t,w) in the process status table is changed to (S,t,a). 

(D) Site S receives a request acknowledgement message for transaction (S,t) 
— do nothing. 

(E) A process at site S releases a resource corresponding to transaction (S,t) 
-- the ent-y (S : t,a) ; s removed x rom the process status table and a 
message is sent to notify the rcleac. 

(F) Site T receives a resource message corresponding to transaction (S,t) -- 
the resource is unlocked and the entry (S,t,a) is removed from the 
resource status table. 

Periodically, a node is chosen as the control. The chosen control node 
performs the following operations: 

(1) Broadcasts a message to all nodes in the system requesting them to send 

their status tables and waits until all tables have been received. 
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(2) 


Constructs a demand graph for the system using only transactions for 
which the resource status table agrees with the process status table (i.e., 
identical entries exist in both the resource status table and the process 
status table). 

(a) If there is no directed cycle, system is not in a deadlock and the node 
releases its control. 

(b) If there is a directed cycle, system is in a deadlock. The node reports 
the deadlock situation to the deadlock resolver. 

In order to show that the above protocol is correct, we have to prove that 
,‘he existence of a directed cycle in the constructed demand graph implies the 
occurrence of a deadlock state. 

Theorem 

A system is in a deadlock if and only if there is a directed cycle in the 
demand graph constructed by the above procedure. 

0.2.3 A Hierarchical Deadlock Detection Protocol 

In very large distributed data bases, it may be very costly to transfer all 
status tables to one site. In particular, if the access pattern is very localized, it 
will be of great advantage if deadlocks are detected locally. In these systems, one 
approach is to group sites which arc close to each other into a cluster. 

Periodically, a node in a cluster is chosen to be the control. This control node 
executes the one phase deadlock detection protocol and constructs a demand graph 
or the cluster. The result obtained by the control node together with the 
. .dcrclustcr accesses (which should be relatively few) arc then sent to a central 
control node (which is also chosen dynamically). Based on this information, the 
central control node constructs the demand graph of the whole system. In this 
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way, deadlocks within a cluster are detected by the control node of the cluster and 
deadlocks among clusters are detected by the centrl control node. 

Definition 

A transaction is a local (interclustcr) transaction if and only if the 
requesting process and the requested resource are in the same (different) cluster(s). 

A Hierarchical Deadlock Detection Protocol 

(A) Periodically, a central control node is chosen. This node performs the 

following operations: 

(1) Chooses dynamically a control node for each cluster. 

(2) Broadcasts a message to all control nodes requesting them to send 
their status information and wait-for relations of the intercluster 
transactions. 

(3) Constructs a demand graph of the system using both the interclustcr 
transactions for which the resource status report agrees with the 
process status report and the wait-for relations (which arc defined 
later) sent from the control nodes. If there is a directed cycle in the 
demand graph, the system is in a deadlock, otherwise, the system is 
not in a deadlock. 

(B) Whenever a node receives a status report request from the central control 

node, it performs the following operations: 

(1) Broadcasts a message to all nodes in the cluster requesting them to 
send their status tables and waits until all tables have been received. 

(2) Constructs a demand graph for the cluster using only local 
transactions for which the resource status table agress with the 
process status table. 
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(3) Computes the transitive closure of the demand graph. If there is a 
directed cycle in the demand graph, the system is in a deadlock. 

(4) Dcrvics the wait-for relations from the transitive closure of the 
demand graph. A process/resource is waiting for a process/resourcc 
if and only if: 

(a) The procesess and/or the resources are in some intercluster 
transactions. 

(b) The process/resourcc is waiting directly or indirectly for the 
process/resource (i.e., here is a directed arc pointing from the 
process/resource to the process/resource in the transitive 
closure of the demand graph). 

(5) Send the intercluster transaction status information and the wait-for 
relations to the central control node. 

The above concept can be extended into many levels. In this way, a 
'Hierarchy of control nodes can be constructed. Due to the local access pattern of a 
system, the amount of information that has to be sent from a child control node to 
its parent can be greatly reduced. 








CHAPTER 5 


Software Reliability 

5.1 Classification of Software Reliability Models 

To analyze and further develop different reliability models, we first 
classify them based primarily on the phase of software life-cycle during which the 
model is applicable, namely, Testing and Debugging phase, Validation phase, 
Operation and Maintenance phase. During the Testing and Debugging phase, the 
implemented software is tested and debugged. It is often assumed that the 
correction of errors does not introduce any new errors. Hence, the reliability of 
the program increases and, therefore, the models used during this phase are also 
called reliability growth models . These models are mainly used to obtain a 
preliminary estimate of the software reliability. However, software developed for 
critical applications, like airtraffic control, must be shown to have a high 
reliability prior to actual use. At the Validation phase, the software is subjected to 
a large amount of testing in order to estimate the reliability. Errors found during 
this phase arc not corrected. In fact, if errors are discovered the software may be 
rejected. The Nelson model (TRW 76) is based on statistical princioles. The 
software is tested with test cases having the same distribution as the actual 
operating environment. After the software has been thoroughly validated it is put 
into operation. During use further errors may be detected or there may be user 
demands for new features. These presures result in maintenance activity (SWA 76, 
SWA 79), i.c., modification of the software. The addition of new features results in 
a growth in the size of the software. During the maintenance phase, the possible 
activities are: error correction, addition of new features and improvements in 
algorithms. Any of these activities canpeturb the reliability of the system. The 
new reliability can be estimated using the models for the validation phase. 
However, it may be possible to estimate the change in the reliability using fewer 










resting cases by ensuring that the original features have not been altered. We do 
not know of any existing software reliability models applicable during this phase. 
Based on the above classifications, we develop a theory of software reliability 
based on the nature of the input domain of the program, i.e., the size of the errors 
nd the number, complexity and continuity of equivalence classes formed in the 
•nput domain. 

,Y2 Testing and Debugging Phase 

The major assumption of all software reliability growth models is that 
inputs are selected randomly and independently from the input domain according to 
■ he operational distribution . 

This is a very strong assumption and will not hold in general, especially so 
in the case of process control software where successive inputs are corrclted in 
time during system operation. To adjustify the above disadvantage, the models are 
developed and can be applied to any type of software, their validity increases as 
d'.e size of the software and the number of programmers involved increases: 

(A) Random Walk Model 

We can view the ecroi oize under o^ciat.onal inputs, .;ay, , as a tandem 

walk process in the interval (0,e). Each time the program is changed (due to error 
corrcctins or other modifications) changes. Let Zj denote the time between 
failures after the j*h change. Z n is a random variable whose distribution depends 
on j. We do not know anything about the random walk process of other than a 
sample of time between failures. Hence, one approach is to construct a model for 
and fit the parameters of the model to the sample data. Then we assume that the 
future behaviour of can be predicted fromthe behaviour of the model. 

(B) Bayesian Model 

An alternative approach is the bayesian approach advocated by Littlewood 
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(LIT 79(B)). Here we postulate a prior distribution for each of j, 2> ••• j- Then 
based on the sample data, we compute the posterior distribution of j + ]. 

5.3 Validation Phas e 

The well known Nelson Method during this phase is based upon the policy 
that the test cases are selected randomly according to the operational distribution. 
However, it suffers from a number of practical drawbacks: 

(1) In order to have a high confidence in the reliability estimate, a large 
number of test cases must be used. 

(2) It does not take into account "continuity" in the input domain. For 
example, if the program is correct for a test cse, then it is likely that it is 
correct for all test cases executing the same sequence of statements. 

(3) It assumes random sampling of the input domain. Thus, it cannot take 
advantage of testing strategics which have a higher probability of 
detecting errors, e.g., boundary value testing, etc. Further, for most 
real-time control systems, the successive inputs arc correlated if the 
inputs are sensor readings of physical quantities, like temperature, which 
Carnot change rapH'y. ’ni'iesc case": we -annot perform random testing. 

(4) It docs not consider any complexity measure of the program, e.g., number 
of paths, statements, etc. Generally, a complex program should be tested 
more than a simple program for the same confidence in the reliability 
estimate. 

The approach we developed reduces the number of test cases required by exploiting 
the nature of the input domain of the program. The input domain based approach 
to the estimation of software reliability is: R = 1 - V c r, where Vc r is the estimated 
remaining error size. Ve r can be deterined by testing the program and locating and 
estimating the size of errors found. In most cases this is simple since it is 
relatively easy to find the inputs affected by a known error. If this cannot be done, 
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random sampling can be used to estimate the si7.e of the error. It is expected that 
software for critical applications will contain no known errors during this phase, 
.eliability of the program given any input distribution by assuming some knowledge 
about the error distribution in the input domain. Furthermore, we can generalize 
this by considering the input distribution as well as the membership function of 
each input element in probabilistic equivalence classes defined as fuzzy sets (ZAD 
79). 

5.5 Applicat ions 

(a) PROGRAMMING LANGUAGE DESIGN: The reliability of a program 
depends greatly on the language on which it is coded. For example, many more 
errors will be introduced in a program coded in a Machine Level Language (MLL) 
•han in a program eded in a High Level Language(HLL). This is the bases of the 
concept of language level introduced by Halstead (I1AL77). However, this criteria 
only considers the difference between the volumes of the programs produced by 
using different languages. Here v/e propose a different measure of the goodness of 
a programming language. The criteria is qualitative and is based on the size of 

ovisible errors which a programmer can commit. A good programming language 
construct is one which maximizes the change of detecting an error when it occurs, 
i.e., it increases the size of likely errors. V.'e consider all methods of validating the 
software, including code reading, static analysis, dynamic analysis and testing. 

(b) PROJECT MANAGEMENT: Another important application of software 
reliability theor is in project management. One obvious use of the reliability 
measure is as a criterion for the acceptance or rejection of the software. Besides, 
sof tware reliability can be applied to the scheduling of testing when several 
different strategics can be used. The analysis is based on the concept of efficiency 
of testing strategics. We develop a probabilistic model which determines a test 
case selection strategy in order to minimize some cost criterion. The cost could be 
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the amount of time required to develop a test case using a particular strategy. 

*lhis is useful for control systems' software since the requirement of high 
confidence in the reliability estimate implies a large amount of testing. The model 
also specifies the optimal distribution of test cases over thevarious modules 
constituting the program. For example, simple modules should be tested less than 
complex error-prone modules. 









Future Work 


One future research area is to develop the criteria for grouping critical 
regions and monitors into sets so to minimize the among of "unsiructuredness” 
created by the grouping. Another future research area is to develop faster 
deadlock detection procedures. Although it has been proven that determining the 
..•afety of a system is NP-complete, the computation complexity can be in 
polynomial to the size of the program if some parameters are fixed. 

One issue which we have not dealt with is thestimtion of the overall 
hardware/software system reliability. The combination of hardware and software 
■ cliability estimates is discussed in (BUN SO, KF.E 76, KLI SO, THO SO). However, 
tne approach generally advocated is to assume that hardware and software failures 
are independent, so that the overall system reliability is the product of the 
software and hardware reliability estimates. This is unsatisfactory since it is 
possible for the software to rectify hardware failures and vice versa. For example, 
the failure of a line printer need not be a system failure if the software can 
rc-direct the output to another device. A viable approach is to view the overall 
,yste.m as being si n'lar to a n.i -machine system. 

For complex systems, methods must be developed for estimating the 
design correctness of the hardware. Thus, failures can be due to software errors, 
hardware component break-downs or hardware design errors. The applicability of 
software reliability growth models discussed above to estimating the design 
correctness needs to be investigated. 

Another important research area is developing techniques for validating 
software reliability models. At present the models arc applied to some project 
data and their validity is deduced from the results. This is not satisfactory since 
very few sets of actual data are available. Further, the models make some 









assumptions which may not hold for the particular project. For example, most 
software reliability growth models assume that the testing process is the same as 
the operational environment, which is not true in general. In this thesis we have 
adopted a deductive approach coupled with several experiments. Also, we have 
derived auxiliary results (e.g., the optimal set of test cases) which seem 
reasonable. Further, we have developed an independent way of validating each 
model, namely, the determination of the error size for the stochastic model and 
the error seeding approach to estimating the correctness probability for the 
theoretical model we developed. 
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